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Abstract 

We study four problems: put n distinguishable/non-distinguishable balls into k non-empty distinguishable/non- 
distinguishable boxes randomly. What is the threshold function k — k(n) to make almost sure that no two boxes 
contain the same number of balls? The non-distinguishable ball problems are very close to the Erdos-Lehner 
asymptotic formula for the number of partitions of the integer n into k parts with k = o(n 1//3 ). The problem is 
motivated by the statistics of an experiment, where we only can tell whether outcomes are identical or different. 
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1 Motivation 



Consider a generic experiment where the state of a complex system is probed with repeated experiments providing 
outcome sequence x%, x%, . . . , x n . The experimenter can tell only whether two outcomes are identical or different. 
So each outcome can be thought of as a sample of i.i.d. draws from an unknown probability distribution over a 
discrete space. The order of outcomes carry no valuable information for us. We want to understand this process 
under the condition that k different outcomes are observed out of n experiments. As an example, think of a botanist 
collecting specimen of flowers in a yet unexplored forest. In order to classify observations into species, all that is 
needed is an objective criterion to decide whether two specimens belong to the same species or not. More generally, 
think of unsupervised data clustering of a series of observations. 

In the extreme situation where k — n and all outcomes are observed only once, the classification is not very 
informative. At the other extreme, when all the outcomes belong to the same class, the experimenter will think 
he/she has discovered some interesting regularity. In general, the information that a set of n repeated experiments 
yields, is the size rrii of each class (or cluster), i.e. the numbers rrii := \{£ : X£ = i}\ (with X)i=i m i = n )j as the 
relative frequency of the observations is what allows to make comparative statements. We expect that when the 
number k of classes is large there will be several classes of the same size, i.e. that will not be discriminated by the 
experiment, whereas when k is small each class will have a different size*fl. 

This is clearly a problem that can be rephrased in terms of distributions of balls (outcomes) into boxes (classes) . 
We consider, in particular, the null hypothesis of random placement of balls into boxes. In this framework, the 
question we ask is what is the critical number of boxes k c (n) such that for k <C k c we expect to find that all boxes 

1 contain a different number m, of balls whereas for k 3> k c boxes with the same number of balls will exist with 
high probability. 

2 Introduction 

Recall the surjective version (no boxes are empty) of the twelvefold way of counting [11) p. 41: putting n distinguish- 
able/non-distinguishable balls into k non-empty distinguishable/non-distinguishable boxes correspond to four basic 
problems in combinatorial enumeration according to Table 1. Our concern in all four type of problems is the 
threshold function n = n(k) that makes almost sure that no two boxes contain the same number of balls for a 
randomly and uniformly selected ball placement. Although studying all distinct parts is a topical issue for integer 
partitions [3j an d compositions 4 , it is hard to find any corresponding results for surjections and set partitions 
except [7]. Our results regarding the threshold functions are summed up in Table 1 in parentheses. We have to 
investigate only three problems, since every fc-partition of an n-element set corresponds to exactly k\ surjections 
from [n] to [k] — namely those surjections, whose inverse image partition is the fc-partition in question. Also, the 
threshold function is the same for compositions and partitions, although the number of compositions corresponding 
to a partition may vary from 1 to fc! Therefore the threshold function for set partitions is the same as the threshold 
function for surjections. 

Our proofs use the first and the second moment method, the second moment method in the form (due to Chung 
and Erdos, see [TU] p-76) below. For events Ai,A2, Ajv, the following inequality holds: 



*This observation can be made precise in information theoretic terms. The label X of one outcome, taken at random from the 
sample, is a random variable whose entropy H[X] quantifies its information content. The size mx of the class containing X, clearly has 
a smaller entropy H[m] < H[X], by the data processing inequality |2J. When k is small, we expect that H[X] = H[m] whereas when k 
is large, H[X] > H[m]. 




(2.1) 



We always will assume 




(2.2) 
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k non-empty boxes 


distinguishable 


non-distinguishable 


n balls 


distinguishable 


surjections (n = k* 3 ) 


set partitions (n — k b ) 


non-distinguishable 


integer compositions (n = fc a ) 


integer partitions (n = k 3 ) 



Table 1: Threshold functions for distinct parts for the four surjective cases in the twelvefold way of counting. 



otherwise the parts clearly cannot have distinct sizes. We did not even attempt to obtain a limiting distribution 
/(c) when n is c times the threshold function — though such an estimate should be possible to obtain. Although 
there are deep asymptotic results on random functions and set partitions based on generating functions (e.g. see 
Sachkov [5]), we do not see how to apply generating functions for our threshold problems. Our results corroborate 
some formulae of Knessl and Kessler [5], who used techniques from applied mathematics to obtain heuristics for 
partition asymptotics from basic partition recursions. 



3 Threshold function for integer compositions 

Recall that the number of compositions of the integer n into k positive parts is C(n, k) = (^-l)- Let ^kj W denote 
the event that the i th and j parts are equal iina random composition of n into k positive parts. Using the first 
moment method, it is easy to see that 



P(3 equal parts) = P ( (J (J A l3 (t) j < £ £ P{A l3 (t)) = (f) £ 

\i<j t J i<j t ^ ' t>l 



<J t / Kj 

, \ fn—2t—V\ /r x in— 2 

k \ V"^ \ fe_3 ) .IK 



C(n-2t,k-2) 
C{n,k) 



gisr £ W(g) =o(1) ' 



as n/k — > oo. We make an elementary claim here that we use several times and leave its proof to the Reader. 

Claim 1 Assume that we have an infinite list of finite sequences of non-negative numbers, di(n), 02 (fi), ajv(n)( n ) 
for n = 1,2,..., such that none of the sequences is identically zero. Assume that the number of increasing and 
decreasing intervals of these finite sequences is bounded, and that max^ dj(n) = ofX^j aj(n)J as n — > oo. Then, for 
any fized k, as n — > oo. we have 

^a*W = (- +o(l)J^ai(7i). 

i:k\i i 

Using the claim we can get a more precise estimate 

t>l 



Next we use (|2.ip and (|3.3p to show that P(3 equal parts) — > 1 as n/k 3 — > 0. The numerator of (|2.ip is the square 
of 

E E ^(*)) - (*) E = + ( 3 - 4 ) 

that grows to infinity. Therefore we can neglect the same term without square in the denominator of (|2.1[) . A 
second negligible term in the denominator arises if i < j and u < v make only 3 distinct indices (note that they 
must occur with the same t). The corresponding sum of the probabilities is estimated by 

k-i\^ c(n- a, k 3) ^ r^t 1 ) , fc 3 (r 3 3 ) , fc 5 _ /VM rttt 



2 C(ra,fc) ~ 2 

7 t>i v > ; t>1 
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A third negligible term in the denominator arises if all four indices are distinct, but the four corresponding parts 
are all the same: 



4 ;E C {n,k) 

This can be estimated by 0(k 7 /n 3 ) like the estimate in (|3.5p . and is similarly negligible. The significant term in 
the denominator is 

2 { 2 JE L cRfc) (3 ' 6) 

corresponding to the cases analysis when 2-2 parts are the same. This term will not change asymptotically when 
we add the t = £ cases to the summation. So ()3.6|) is asymptotically equal to (using Claim Q] again) 

4 /n-2*-2t-l<* 4 (n-2*-2<| 4 /n-3\ , g 

K \ " \ - V fc-5 I K \ ~* V fc-4 / K Vfc-3/ K /o 7 ^ 

4^^ f"-i) 8 ^ ~ 16(7-1) ~ 16n 2 ' 1 ; 

£>1 t>l Vfc-l/ 1>1 \k-lJ \k-lJ 

We conclude that the numerator in (|2.1I) is asymptotically equal to its denominator Q3.7p . proving that 
P(3 equal parts) — > 1 as n/k 3 — > 0. 

4 Erdos-Lehner and the threshold function for integer partitions 

Let T>(n, k) denote the number of compositions of n into k distinct positive terms. In the previous section we proved 

U m = land (4.8) 

n/fc 3 ->-o C(n, k) 

lira ?M = 0. (4.9) 

n/fc 3 -s-oo C(n, k) 

Let p(n, A;) denote the number of partitions of n into k positive terms and q(n, k) denote the number of partitions 
of n into k distinct positive terms. For 1 < x\ < x 2 < ... < Xk, the well-known bijcction xi + X2 + ■■■ + Xk — > 
(xi) + (x 2 + 1) + {x 3 + 2) + ... + (xk + k-l) shows that g(n, fc) = p(n - ($),k). 

A theorem of Erdos and Lehner ([5], see also in [3]) asserts that for k — o(n 1 / 3 ), the following asymptotic formula 
holds: 

1 „, 1 /n-1 



^ fc )~ifc! C(n ' fc) = ifelU-lJ- (4 ' 10) 
Gupta's proof to Erdos-Lehner ([5], see also in [3]) obtains 

l -C{n,k)<p{n,k) = q(n+ Q)> fc )< tJ[ C ("+ Q)' fc ) ( 4 ' H ) 



k 

from the asymptotic equality of 



n-l\ (n+(*)-l 
fc- 1/ V fc-1 



(4.12) 



the leftmost and rightmost terms in (|4.1ip . under the assumption k = o(n 1 / 3 ). 

To get the n = k 3 threshold function for integer partitions, we first show that for n/k 3 —> 0, 

k\q{n,k) = T>(n,k), 
V(n,k) = o(C(n,k)), from gU), 
and C(n, k) < klp(n,k). 

To do the case n/k 3 — > oo, i.e. k = o(?i 1 / 3 ), we use Erdos-Lehner twice and also f|4. 12[) : 

*^ P (-©.^^"-®r 1 )~s(;::)~*')' 
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5 Threshold function for surjections 

Let F(n, k) denote the number of [n] — > [k] surjections. It is well-known from the Bonferroni inequalities that 
k n — n(k — \) n < F(n, k) < k n and hence for n >> /clog k, F(n, k) = (1 + o(l))fe™ uniformly. Take an / : [n] — > [k] 
random surjection. Let Aij(t) denote the event that for 1 < i < j < k we have = |/ _1 0')l = t. Observe 

that P{A l3 {t)) = E t >! &)0 F{n FSi; 2) and recall (?) ~ *L. Observe* I that 



P (3i<r. \f-\i)\ = \r\m = ^(UU^(*))^EE P (^(*))~(2)E( 2 ")(T) ( ^ 21 



n-2t 



fc\ / 2\ ^ f n\ 2 \ 1 



Let 6(n, i) denote the term (™)p*(l —p) n ~ l from the binomial distribution with p = 1 — |. It is easy to see that the 
core summation in (I5.13[) is 



Observe that for this binomial distribution fi = np = and <r < -y/np = y 2p . Recall from [T] the large deviation 
inequality for sums of independent Bernoulli random variables: 

P^\Y-n\ > e/^< 2e c «", 

where c e = min{ln(e e (l + e)( 1+e \ e 2 /2}. Wc select e = with which for sufficiently large n, c e = e 2 /2 = ^ - . 
Set A = (1 — e)/i and 2? = (1 + e)/i- As [A, 23] includes the range where the normal convergence takes place, 
J2a<kb KM) ~ 1- By ClaimlU J2a<2i<b b ( n > ~ V 2 - Also > if ^ < i < 23, then t ~ ^. By the large deviation 
inequality above and fc < -^/n from (|2.2p . we obtain 



t([A,B) 

and combining with Claim [T] we obtain 



E b ^ = o (\f^ E 

t>i \ A<t<B / 



tg[.A,B] 

Putting together these arguments: 



e 6 ^)=°(y^ e 



t>l v A<2t<B t>l t>l 

We obtain asymptotic formula for the upper bound with 



^^P(A i ,(*)) = (l + (l))^jX (5.14) 

i<3 t 



*Note that for large t (i.e. n — 2t = O(fclnfc)) the approximation for P(Aij(t)) is not accurate. The same problem occurs for small 
t (t = O(l)), because of the estimate of (?)■ The corresponding terms, however, are negligible both in the sum of probabilities and in 
J57T3}. 



5 



which goes to zero as n/k 5 — > oo. 

Next we use dHJ) and ([5TT4"|) to show that P(3i < j: = ~> 1 as n / k5 ~> °- Thc numerator of 

(|2.ip is the square of (|5.14l) that grows to infinity. Therefore we can neglect the same term without square in the 
denominator of (|2.1|) . 

A second negligible term in the denominator arises if i < j and u < v make only 3 distinct indices (note that 
they must occur with the same t). The corresponding sum of the probabilities is estimated by 

Jk /n\ (3t)! F(n - 3t, k - 3) < 3 y> /n\ (3Ve) 3t %/6^rf (fc - 3)™~ 3t 



We have from the Binomial Theorem J2t (") (fc^l) = {} ¥^3) " Working with every third term in a binomial 
distribution with P = § like we worked above with every second, one obtains the upper bound for (|5.15p 

using (|5.14|) as n/fc 5 — !• 0. 

A third negligible term in the denominator arises if i < j and u < v are 4 distinct indices, but the corresponding 
parts (set sizes) are all equal. The corresponding term is 

'k\ ^ fn\ (4*)! F(n-4t,k-4) 

This is easily estimated by 0(k 11 ^ 2 /n 3 / 2 ) like the estimate in (|5.15p . and is similarly negligible compared to (|5 . 14[) 
as n/k 5 — > 0. The significant term in the denominator is 

/ k\fk-2\ sr ^ (n\(2£\(n-2£\(2t\F(n-2£-2t,k-A) ,„ ^ . 

corresponding to the cases when for 4 distinct indices 2-2 parts (set sizes) are the same. The siginificant term will 
not change asymptotically when we add the t = £ cases to the summation. 

We do not repeat below arguments about the binomial distribution which we went through before. So (|5.16[) is 
asymptotically equal to 

fc 4 ^/?^ A 1 T-^fn-2£\ 4* (k - A) n - 2£ - 2t 



fc 4 ^/n\ 4* (fc-4)"^v^/n-2Af 2 \ 1 



• , , W VEF fc« Iji » Jl^lJ ^ (5 ' 18) 



n-2£ 



fc 4 ^/n\ 4^ (fc-4)"- 2 ^ 1/ 2 \ / fc-2 

fc" '2^ 1+ fc-4j V 2 ^-2^) ! ' <!) 



2/ 



g( 2 ")(^) < 5 - 2o > 



fc 4 / 2 \ 1 / 2 \ k-2 k 5 



8tt\ jfe / 2 V fe-2/ Yx( n_4 l) 327m 



(5.21) 



We conclude that the numerator in (j2.ip . which is (|5.14p squared in our setting, is asymptotically equal to its 
denominator (|5.17[) . proving that P(3i < j: |/ _1 (i)l = -> 1 as n/P 0. 
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